Consistent Estimation of the Number of Unseen Elements
نویسندگان
چکیده
We observe a sample of text of n tokens from a large corpus of written text and note the occurrence of N distinct word types. We then ask what the total number of unseen word types in the population from which the sample was drawn is. The commonly used LNRE (large number of rare events) regime suggests a natural estimator of the number of unseen word types in the population using the relatively small sample of size n. We show that this nonparametric estimator is statistically consistent in the LNRE regime. 1 The Problem We observe (X1, . . . , Xn), an i.i.d. sequence drawn according to a probability distribution P from a large, but finite, alphabet Ω. In several problems of practical interest, such as those in natural language and speech, the observation size n is small compared to the cardinality of the set of letters in the alphabet Ω with a non-vanishing probability of occurring. Our goal is in estimating the “essential” size of alphabet Ω using only the relatively small number (n) of observations. Estimating the number of unseen species, a common problem in ecological studies ((Fisher et. al. , 1943) studies the problem of estimating the number of Malayan butterfly species), and estimating the number of unseen words in a corpus ((Efron and Thisted, 1976) studies the number of words Shakespeare knew but did not use) serve as motivational examples to the scenario of our focus. In the course of its long-standing history several estimators, parametric and non-parametric, have been proposed (Gandolfi and Sastri, 2004). The parametric estimators fit an empirical Bayes model to the observed data and hence estimate the size of the unseen elements. For instance, the study in (Efron and Thisted, 1976) involves supposing that the underlying probability distribution is a Poisson mixture and Fisher et. al.’s work (Fisher et. al. , 1943) involves further assuming that the mixing is by a gamma distribution. Several non-parametric estimators have also been proposed (Gandolfi and Sastri, 2004) but none have been shown to be consistent. In this paper, we propose a non-parametric estimator for the size of unseen elements motivated by the characteristic property of word frequency distributions, the Large Number of Rare Events (Baayen, 2001). We also demonstrate that the estimator is strongly consistent under a natural scaling formulation described in (Khmaladze, 1987). 1.1 A Scaling Formulation Our main interest is in probability distributions P with the property that all the letters in the alphabet Ω are unlikely, i.e., the chance any letter appears eventually in an arbitrarily long observation is strictly between 0 and 1. The authors in (Baayen, 2001; Khmaladze and Chitashvili, 1989; Wagner et. al. , 2006) propose a natural scaling formulation to study this problem; specifically, (Baayen, 2001) has a tutorial-like summary of the theoretical work in (Khmaladze, 1987; Khmaladze and Chitashvili, 1989). In particular, the authors consider a sequence of alphabets and probability distributions, indexed by the observation size n. Specifically, the observation (X1, . . . , Xn) is drawn i.i.d. from an alphabet Ωn according to probability Pn. If the probability of a letter, say ω ∈ Ωn is pn, then the probability that this specific letter ω does not occur in an observation of size n is (1− pn) . (1) For ω to be an unlikely letter, we would like this probability for large n to remain strictly between 0 and 1. This implies that pn is Θ ( 1 n ) , i.e., č n ≤ pn ≤ ĉ n , (2) for some strictly positive constants 0 < č < ĉ <∞. We will assume throughout this paper that č and ĉ are the same for every letter ω ∈ Ωn. This implies that the size of the alphabet is growing linearly with the observation size:
منابع مشابه
Estimating the unseen from multiple populations
Given samples from a distribution, how many new elements should we expect to find if we continue sampling this distribution? This is an important and actively studied problem, with many applications ranging from unseen species estimation to genomics. We generalize this extrapolation and related unseen estimation problems to the multiple population setting, where population j has an unknown dist...
متن کاملEstimation of geochemical elements using a hybrid neural network-Gustafson-Kessel algorithm
Bearing in mind that lack of data is a common problem in the study of porphyry copper mining exploration, our goal was set to identify the hidden patterns within the data and to extend the information to the data-less areas. To do this, the combination of pattern recognition techniques has been used. In this work, multi-layer neural network was used to estimate the concentration of geochemical ...
متن کاملRegularized Autoregressive Multiple Frequency Estimation
The paper addresses a problem of tracking multiple number of frequencies using Regularized Autoregressive (RAR) approximation. The RAR procedure allows to decrease approximation bias, comparing to other AR-based frequency detection methods, while still providing competitive variance of sample estimates. We show that the RAR estimates of multiple periodicities are consistent in probabilit...
متن کاملSpatial modelling of zonality elements based on compositional nature of geochemical data using geostatistical approach: a case study of Baghqloom area, Iran
Due to the existence of a constant sum of constraints, the geochemical data is presented as the compositional data that has a closed number system. A closed number system is a dataset that includes several variables. The summation value of variables is constant, being equal to one. By calculating the correlation coefficient of a closed number system and comparing it with an open number system, ...
متن کاملSurveying Introspection of Architecture of Jame` Mosque of Isfahan with Emphasis on Grounded Study of Unseen Concepts of Hafez' and Mulavi's Lyrics
There are close relationships between hidden structures of mosques and unseen concepts embodied in Persian language and literature of Iran that show that construction of famous mosques in Iran, especially in Isfahan Style are immortal and timeless. A question arises in this context as to what factors have led to the manifestation of unseen concepts in the architecture of Isfahan mosques object...
متن کاملA Critique of the View Claiming Conflict in the Verses of the Knowledge of the Unseen
The claim of conflict in the verses of the knowledge of the unseen in Quran is one of those made by Brasher – the Jewish orientalist. He believes that the verses which consider the knowledge of the unseen to be only specific to God are in conflict with those verses referring apparently to the Prophet (p.b.u.h) and some of the divine selected people's awareness of the unseen. Classifying the ver...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007